Docket No : DE920000064US1 
Inventor : BAUER 

Title : METHOD AND SYSTEM FOR 

CASE CONVERSION 



APPLICATION FOR UNITED STATES 
LETTERS PATENT 



"Express Mail" Mailing Label No.: EK830785600US 
Date of Deposit: August 21, 2001 



I hereby certify that this paper is being 
deposited with the United States Postal Service 
as "Express Mail Post Office to Addressee" service 
under 37 CFR 1.10 on the date indicated above 
and is addressed to: Box Patent Application, 
Assistant Commissioner for Patents, Washington, 
D.C. 20231. 



Name: Ann S. Lund 
Signature: 



I NT ERNAT I ONAL BUSINESS MACHINES CORPORATION 



METHOD AND SYSTEM FOR CASE CONVERSION 



BACKGROUND OF THE INVENTION 

5 1 . Field of the Invention 

The present invention relates to a method and system for converting a first set of elements into a 
second set of elements. More particularly, the present invention relates to a method and system 
for case conversion, i.e., characters having a particular property, such as lowercase, uppercase or 
10 titlecase, are converted into characters having a different one of such properties. 

2. Description of the Related Art 

Companies often develop an initial version of a system or program that just deals with one 
15 y particular language, e.g., English. Normally, it is just a matter of time when a version of the 
j;V system or program is needed that can handle a different language. A common approach is still to 
+ just go through all the lines of code, and translate the literal strings. 

n\ This might be an acceptable approach in case the system or program is only needed in one 
20 ]Z additional language since the translation is a time-consuming process. Not all literal strings might 
U need to be translated. Therefore, the translation process requires human judgment. Moreover, 
each new version of the system or program needs to be prepared in the same way, costing 
resources, time and money. In addition, since the company ends up having multiple versions of the 
program code, maintenance and support becomes more expensive as well. This is, because every 
25 change of the program code needs to be applied to each of the different language versions. Not 
even thinking about the danger that a translator may introduce bugs by mistakenly modifying 
code. 

More and more, the companies address the aforementioned multi-lingual issue earlier in system 
30 design. A general technique to internationalize systems and programs is to separate literal strings 
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from the program code so that the program code never needs modification because of adapting 
the program to different languages. It can be achieved by providing separate files containing the 
translatable information. However, this needs to be addressed during the program design or it 
requires a number of modifications to the code. 

All translatable strings need to be moved into separate files, so called resource files, and the 
program code needs to be changed, in order to be enabled to access those strings when needed. 
These resource files can be flat text files, databases, or even code resources, but they are 
completely separate from the main code, and contain nothing but the translatable data. 

Programs having such changes applied to meet the fundamental requirements to function in 
different international environments. In order to localize such a system or program, i.e., to adapt 
such a system or program to requirements of a different country, only the resource files need to be 
translated. Thus, no changes in the program code might be involved. It might not even be 
necessary to have programmers doing the translation. The resource files might just be handed 
over to a translation agency to modify. 

However, this only solves one aspect of the multi-lingual issue, namely, how to provide a system 
or program with a translation of labels, menus or user messages. Another issue is to display the 
translated strings on the screen. As long as the same character set can be used for the different 
languages, it might be straight forward. However, different European languages are already using 
quite a few different characters besides the widely used characters 'a' to c z\ Furthermore, there 
are languages that are not even using the Latin alphabet, such as, most of the Slavic languages 
that are using the Cyrillic alphabet or the Greek language using the Greek alphabet and so on. 

In order to solve this issue different character sets are needed which used to be encoded using 
different codepages. Nowadays, internationalized systems and programs are using a universal 
character encoding standard, such as ISO/DEC 10646 (International Organization for 
Standardization / International Electrotechnical Commission) or the Unicode Standard. 
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By using such a standard, a single internationalization process can be implemented that handles 
the requirements of all the world markets at the same time. Since such standards provide a single 
definition for each character, it handles the characters for all the world markets in a uniform way 
and it avoids the complexities of different character code architectures. 

Now, the thus prepared systems and programs can handle different translations of labels, menus 
and user messages. They can display such messages in the appropriate character set and are able 
to store all literal information without a danger of data corruption because of mixed up character 
sets. However, there is still more functionality needed to provide full internationalization. 

A majority of systems and programs, in particular, word processors, data bases and search 
engines, need to provide the functionality of case conversion. A "case" is a feature of certain 
alphabets where the letters have two distinct forms. These variants, which may differ markedly in 
shape and size, are called 'uppercase letter', also known as 'capital' or 'majuscule', and 
'lowercase letter', also known as 'small' or 'minuscule.' Hence, it is a normative property of 
characters. Besides the properties uppercase and lowercase a third one is distinguished in case 
conversion it is called 'titlecase.' 'Titlecase 5 means an uppercased initial letter followed by 
lowercase letters in a word. This is a convention often used in titles, headers, and entries, as for 
example in a dictionary, glossary or a table of contents. 

Case conversion, however, is not trivial, since depending on the particular language alike letters 
might have to be treated differently. This is because of their particular case mapping, i.e., the 
association of the uppercase form, lowercase form, and titlecase form of a letter. Particular 
characters may expand to two characters when converted to uppercase, they may have different 
case mappings depending on the context or they may have case mappings that differ from 
language to language. 

A state of the art approach addresses the aforementioned issue by doing the case conversion 
character by character having the special cases hard-coded. For each character it is checked 
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whether a different conversion is needed because of the language or position of the character 
under consideration. 

From US 6,055,365 a method is known of using a computer to translate a source text whose 
5 glyphs and control codes are represented by a string of code points from a set of source code 
points to a destination text whose glyphs and control codes are represented by a string of code 
points from a set of destination code points. The method comprises the steps of accessing a 
translation state table, whereby the translation state table has at least one row of cells and each 
row has an associated state value. The cells, however, are indexed by the source code points. A 
10 current state is used to select a row of the translation state table. Then, an input code point 

sequence from the source text is used to select a cell within the row. If the cell contains a next 
state value, then the steps of using the current state and of using an input code point sequence is 
=2 repeated until a desired destination code point sequence is provided. Later, the current state is 

h D updated with a next state value, and finally, the steps of using a current state, using an input code 

w 

15 b] point sequence, and repeating, is repeated for each next input code point sequence. 

; P The described method teaches to implements a general purpose state machine as a computer 
p program. The general purpose state machine needs a lookup step for every single byte in the input 
stream to determine the next state. This creates a lot of overhead that slows down the processing. 

It is therefore an object of the present invention to provide a method and a system having an 
improved processing speed. 

SUMMARY OF THE INVENTION 

25 

The foregoing object is achieved by a method and a system for converting a first set of elements 
into a second set of elements, whereby at least one element of the first set has a context dependent 
relation to one or more elements of the second set according to the independent claims. The 
expression 'context' not only refers to elements before and after the element under consideration, 
30 but also to the whole surroundings that gives meaning to the conversion process. For example, in 
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case characters need to get converted, the context might also be the language the characters are 
used in, or the encoding scheme being used. 

The focus of the invention is on speed. Therefore, the method and system seek to utilize basic 
5 functionality for translating elements already provided on a computer system to be used in 

conjunction with the present invention. The provided basic functionality for translating elements, 
in general, is simple but fast. 

The present invention makes use of a standard translation function. The function that is used 
10 within the method and system according to the present invention is able to translate a block of 

elements of a first set into a block of elements of a second set. However, the provided function is 
purely able to handle a static relation between the elements of the first and the second set, i.e., an 
% element of the first set gets translated into one particular element of the second set under all 
jO circumstances. In case a different treatment is required, the function needs to interrupt its 
15 hj processing and raise an exception. The relation between the elements of the first and the second 
i2 set is provided to the function in form of a table specifying for each element of the first set either 
! t* one particular element of the second set or an exception handling element in case no static relation 
D exists. Whenever one element of the first set would be translated to such an exception handling 
^ element the function interrupts and an exception handling function gets executed. 

20 

H Preferably, the function is implemented as a machine instruction, i.e., a function that gets 

processed at the hardware level of a computer. This makes the computation of the instruction 
much faster than a software implementation. Such a function that converts a whole batch of 
characters with one call, for example, exists on the S/390 hardware platform manufactured by 
25 International Business Machines Corporation. On this hardware platform the particular function is 
called TRTT (Translate Two to Two). 

However, since the required function only provides a simple translation process a software 
implementation, e.g., in machine code, i.e., a representation of a computer program that is 
30 actually read and interpreted by a computer, might be sufficiently fast. 



DE920000064US1 



5 



In order to exploit the simple but fast translation function provided by the computer system to be 
used, according to the present invention, the first set of elements is split into a first subset 
consisting of such elements getting translated to one particular element of the second set and into 
a second subset consisting of the remaining elements of the first set. A first table is composed in 
which each element belonging to the first subset is assigned to the respective element of the 
second set and all elements of the second subset are assigned to an exception handling element. A 
second table is composed representing rules according to which an exception handling function 
translates the elements of the second subset. A block of data to be converted is determined, 
whereby the data is formed by elements of the first set. Then, the first and the second table and 
the determined block of data are provided to the translation function. Finally, the translation 
function is processed. 

BRIEF DESCRIPTION OF THE DRAWINGS 

The above, as well as additional objectives, features and advantages of the present invention, will 
be apparent in the following detailed written description. 

The novel features of the invention are set forth in the appended claims. The invention itself, 
however, as well as a preferred mode of use, further objectives, and advantages thereof, will best 
be understood by reference to the following detailed description of an illustrative embodiment 
when read in conjunction with the accompanying drawings, wherein: 

Fig. 1 illustrates a generation of a first table being used in the method and system according to the 
present invention; 

Fig. 2 shows a flow chart depicting a first mode of operation of the method and system according 
to the present invention; 
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Fig. 3 shows a flow chart depicting a second mode of operation of the method and system 
according to the present invention; 

Fig, 4 shows a detailed view of a table defining special rules for context dependent case 
5 conversion; and 

Fig. 5 illustrates the generation of the table of Fig. 4. 

DETAILED DESCRIPTION OF THE INVENTION 

10 

With reference to Fig. 1, there is depicted a first chart 100 having a first column 102, a second 
column 104 and a third column 106. The chart 100 defines a case conversion for different 
% characters. 

15 y In the first column 102, the glyphs of all characters to be converted are depicted. A glyph is an 
Ti image used in the visual representation of characters. The characters 'A' and 'B' in the first 
^ column 102 are only cited as an example. The dots in the first and the fourth row indicate that the 
O chart is actually much larger covering all characters needed. 

20 2 The second column lists the hexadecimal code of the characters £ A' and C B', i.e., the 
M representation of the respective characters in a given format. In the present illustration the 
characters A and B are encoded in an universal character encoding standard, following the 
ISO/IEC 10646 standard (International Organization for Standardization / International 
Electrotechnical Commission) and the Unicode Standard, respectively. 

25 

Finally, the third column shows the hexadecimal code of the lower case representation of the 
respective character A or B. In other words, whenever the character A having the hexadecimal 
code x0041 is meant to be converted into its lowercase representation, it has to be replaced by the 
hexadecimal code x0061. Of course, this is only true if the same encoding standard is being used. 
30 However, in case other encoding standards are used, a corresponding chart is provided. This chart 
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cannot directly be used for an automated character conversion in accordance with the method and 
system for converting a first set of elements into a second set of elements provided by the present 
invention. Therefore, starting from the first chart 100 a first table 1 10 is composed as indicated by 
the arrow 112. 

5 

The first table 110 consists of a first column 1 14 and a second column 116. The first column lists 
the addresses of a linear block of memory cells and the second column 116 list the contents of the 
respective memory cells. The first chart 1 10 is now generated in a way that the code of a 
lowercase representation of a character is stored in the field of the second column that is indicated 
10 by the address that corresponds to the encoding of the character. In other words, the code of the 
characters to be converted are interpreted as addresses of a linear block of memory cells and the 
code representing the result of the case conversion is stored in the respective memory cells. For 
% example, the hexadecimal code x0041 encoding the character ' A 5 now represents the address of a 
11 memory cell, that contains the lowercase representation x0061 of the character 'A' in the given 
15 y universal encoding standard. 

s f Thus, through the processing of the first chart 100 a block of memory cells is achieved, that 

q contains the information about the character conversion originally stored in the first chart 100. 

m Tables specifying the character conversion to uppercase and to title titlecase are composed in the 

20 ?z same way. Of course, different charts need to be provided. Such charts can normally be acquired 

H from the institutions setting up the respective universal encoding standards. 

Now with reference to Fig. 2, there is depicted a flow chart showing a first mode of operation of 
the method and system according to the present invention. The block 200 illustrates a translation 
25 function provided by a computer system to be used in conjunction with the present invention. The 
translation function is able to convert a batch of characters with one call. The batch of characters 
is provided to the translation function by specifying the respective addresses where the batch of 
characters can be found. This is indicated by the first arrow 202. 
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In order to instruct the function how to translate the provided batch of data a previously 
composed table 204 is provided to the translation function. The table 204 corresponds to the first 
table 1 10 shown in Fig. 1. Alternatively, a different table 206 can be provided to the translation 
function instructing it to perform a different conversion. The table 204 enables the translation 
5 function to convert the inputted batch of characters into lowercase, whereas the different table 
206, for example, would instruct the translation function to convert the provided batch of 
characters to uppercase. Finally, if the end of the supplied batch of characters, here the source, is 
reached the results are available for further processing, as indicated by the second arrow 208. 

10 Up to now, the first mode of operation dealing with a basic scenario of case conversion has been 
described. In the basic scenario, a character is converted to one particular character under all 
circumstances. Case conversion, however, is not so trivial. Depending on the particular language 
S alike letters may have to be treated differently. 

15 W Characters may expand to two characters when converted to uppercase. For example, the German 
U character "B", referred to as 'Latin Small Letter Sharp S\ expands to the sequence of two 
e characters 'Latin Capital Letter S\ 

flj Characters may have different case mappings, depending on the context. For example, the Greek 
20 E character "I", 'Greek Capital Letter Sigma 5 , has a first lowercase representation "a", 'Greek 
|pSs Small Letter Sigma', if it is followed by another letter, and a second lowercase representation "q", 
'Greek Small Letter Final Sigma', if it is the last letter in a word. 

Furthermore, characters may have case mappings that depend on the language. For example, in 
25 the Turkish language the letter 'Latin Capital Letter F has got the lowercase representation of 
'Latin Small Letter Dotless F, whereas in Turkish the letter 'Latin Small Letter Y has got the 
uppercase representation of 'Capital Letter I With Dot Above 5 . 
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With reference to Fig. 3, a flow chart depicting a second mode of operation of the method and 
system according to the present invention is shown. In this mode of operation the method and 
system also deals with such characters that need a context dependent conversion. 

5 The block 300 illustrates again a translation function provided by a computer system. The 

translation function is able to convert a batch of characters provided to the function, as indicated 
by a first arrow 302, with one call. The conversion is performed in accordance with a previously 
composed first table 304 provided to the translation function. 

10 The first table 304 corresponds to the table 1 10 shown in Fig. 1, but shows some additional 

features. The table consists of a first and a second column. The first column lists the addresses of 
a linear block of memory cells and the second column list the contents of the respective memory 
% cells, as already described in greater detail with reference to Fig. 1 . In the first table 304, the 
* contents of the memory cells is a special exception handling element, referred to as 'stop element', 
15 y in such cases in which a context dependent conversion is required. Whenever the translation 

function translates a character to the stop element the translation function interrupts its processing 
^ and an exception handling function is executed, as illustrated by arrow 310. Block 3 12 illustrates 
0 the exception handling function. The execution of the exception handling function can be invoked 
jfj either by the translation function itself or explicitly as part of the inventive method. 

20 j; 

A previously composed second table 3 14 represents rules according to which exception handling 
function translates the characters requiring a context dependent conversion. After having 
determined the correct, context specific conversion, the exception handling function is terminated 
and the control of the process is returned to the translation function, as depicted by the arrow 
25 316. The previously described processing steps are repeated automatically by the translation 
function until the whole batch of characters have been converted. If the end of the source is 
reached the translation function terminates and returns the converted batch of characters for 
further processing as indicated by the arrow 318. 
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Fig. 4 shows a detailed view of a special casing table 400. The special casing table 400 
corresponds to the second table 3 14 shown in Fig. 3. The expression 'special casing 5 refers to the 
rules according to which all context dependent characters get converted. The table consists of 
eleven columns, eleven rows and the column titles. It is acknowledged that the shown table forms 
only a small part of all the special casing needed. Further, the particular representation of the 
information as shown in the table is only one possible ways of representing it, e.g., the rows or 
columns can be arranged differently or the comments and column titles can be omitted at all. The 
dots in rows 1, 3, 6 and 1 1 indicate that other rows were not drawn purely for sake of clarity. 

The first column contains the code of a source character. This is the character that is meant to be 
converted. The second column indicates the number of bytes of a lowercase mapping, whereas the 
third column specifies the code of the lowercase mapping. Correspondingly, the fourth column 
indicates the number of bytes of a titlecase mapping, the fifth column specifies the code of the 
titlecase mapping, the sixth column indicates the number of bytes of an uppercase mapping and 
the seventh column specifies the code of the uppercase mapping. The eighth column contains a 
country code. A language code is provided in the ninth column. The tenth column keeps a 
condition list and, finally, the eleventh column provides some comments. 

With reference to the second row, there is depicted an example of a character expanding to two 
characters when converted to uppercase. The hexadecimal code xOODF encodes the German 
character "B", referred to as 'Latin Small Letter Sharp S\ The lowercase mapping is identically 
encoded in two bytes, since it is already lowercase. In uppercase or titlecase it expands to the 
sequence of two characters 'Latin Capital Letter S' encoded as x0053, x0053 now having a length 
of four bytes. 

If a characters has got a different case mappings, depending on particular conditions, more than 
one row is required for the same character to represent the conversion rules, one for each 
condition. The fourth and fifth row show the example of the Greek character 'Greek Capital 
Letter Sigma', having the hexadecimal code x03A3. The fourth row show a scenario if the 
character is the last one in a word, as indicated by the condition 'final 5 . In this scenario the 
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character gets converted to its lowercase representation "a", 'Greek Small Letter Sigma 5 , having 
the hexadecimal code x03C2. If the letter is not the last one in a word, its lowercase 
representation is "<;", 'Greek Small Letter Final Sigma 5 , having the hexadecimal code x03C3 is 
used. 

The seventh and the ninth row show an example wherein common Latin capital and small letters 
need to be treated differently because of the language they occur in. In the Turkish language the 
letter 'Latin Capital Letter F having the hexadecimal code x0049 has got the lowercase 
representation of 'Latin Small Letter Dotless F having the hexadecimal code x013 1, whereas the 
letter 'Latin Small Letter F with the hexadecimal code x0069 has got the uppercase 
representation of 'Capital Letter I With Dot Above' having the hexadecimal code x0130. Since 
this is only true for the Turkish language the country code shows 'TR 5 . In the English language, 
for example, 'Latin Capital Letter F having the hexadecimal code x0049 would be converted to 
'Latin Small Letter F with the hexadecimal code x0069 when changing to lowercase and vice 
versa, as shown in rows eight and ten. 

Finally, with reference to Fig. 5, there is illustrated the generation of the special casing table as 
shown in Fig. 4. A first chart 500 having three columns lists all codes of characters to be 
translated and the codes of their lowercase mapping. In a second chart 502, there is a list of 
special casing. In addition to the columns shown in the first chart 500, in the second chart 502 a 
column 'condition 5 can be found indicating the condition for a special casing. 

A first lowercase mapping for the Greek character "£", 'Greek Capital Letter Sigma 5 , is encoded 
by the hexadecimal code x03C3 standing for "a 55 , 'Greek Small Letter Sigma 5 . However, there is 
a second lowercase mapping for this character in the special casing chart 502. If the character to 
be converted is the last one in a word, as indicated by the condition 'final 5 , then a different 
lowercase mapping is needed, here, hexadecimal code x03C2, standing for "q", 'Greek Small 
Letter Final Sigma 5 . 
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Now, a first table 504 and a second table 506 are composed from the information given in the 
aforementioned charts 500 and 502. The first table 504 contains all information of a regular 
treatment for a case conversion to lowercase. From the first chart the second column is taken that 
lists the codes of all different characters that might be converted, as indicated by arrow 507. Then 
5 as a lowercase mapping a 'stop' code is assigned for all characters that have an entry in the 

second chart specifying the special casing conditions, as indicated by arrow 508. For example, the 
hexadecimal code x03 A3 has got two different lowercase mappings, as already mentioned. 
Therefore, there is the 'stop' in the same row. Consequently, the information from the first chart 
500 and the special casing information from the second chart 502 are written to the second table 
10 506, as indicated by arrows 510 and 512. Characters with only one lowercase mapping only show 
an entry in the first table 504. 

% Besides the characters that have different casings, there are also so called 'uncased 5 characters, 

* i.e., characters that never change in a case conversion, such as, whitespace, i.e., any contiguous 
15 m sequence of spaces, tabs, carriage returns, and/or line feeds, comma, full stop, semicolon, etc. 

* In another embodiment of the present invention the uncased characters are used to implement a 
O table driven character conversion to titlecase. In a conversion to titlecase only characters at the 

m beginning of a word get converted to uppercase. Before starting the conversion process by calling 
20 ]Z a standard translation function, as explained above, a special composed table is provided to the 
H translation function. In this special table the contents field of all rows marked by codes of uncased 
characters are filled with a stop element indicating the need of a special treatment. Whenever a 
character gets translated to a stop element an exception handling function is called. Then, the 
exception handling function can determine the next cased character, as to be the opposite of an 
25 uncased character, and perform the conversion to uppercase. Hence, just by providing a different 
table a whole batch of characters can be converted to titlecase with one call to the translation 
function. 

Another major advantage of the method and system according to the present invention lies in the 
30 fact that the translation function and the exception handling function can stay unchanged when 
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new case mappings are coming up. Advantageously, no information about the treatment of a 
characters during case conversion is hard-coded, i.e., written directly into a program, possibly in 
multiple places, where it cannot be easily modified. 

5 The present invention can be realized in hardware, software, or a combination of hardware and 

software. Any kind of computer system - or other apparatus adapted for carrying out the methods 
described herein - is suited. A typical combination of hardware and software could be a general 
purpose computer system with a computer program that, when being loaded and executed, 
controls the computer system such that it carries out the methods described herein. The present 

10 invention can also be embedded in a computer program product, which comprises all the features 
enabling the implementation of the methods described herein, and which - when loaded in a 
computer system - is able to carry out these methods. 

'■i.sssf 

Computer program means or computer program in the present context mean any expression, in 
15 W any language, code or notation, of a set of instructions intended to cause a system having an 
u information processing capability to perform a particular function either directly or after either or 
* both of the following a) conversion to another language, code or notation; b) reproduction in a 
D different material form. 

20 I? Further, the present invention can advantageously be incorporated at least partly in a hardware 
H implementation directly burnt-in into an integrated circuit, such as a hardware chip. The integrated 
circuit then comprises hardware implementing and reflecting at least parts of the steps of the 
inventive code conversion method. Considering the steadily growing diversity of 
telecommunication devices and their steadily increasing function range including more and more 

25 technical features such a chip can then be used in a large variety of devices. In view of devices 
available today such a chip can be advantageously used in any device which forms part of any 
international communication. For example, Internet servers, routers in any kind of network, e.g., 
the Internet, set-top boxes for TV or radio receiving devices, particularly digital TV or radio, 
mobile phones, any kind of hand held computing and/or telecommunication device or any other 

30 device having an input interface for processing any foreign-language data. 
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